We are using a two-component hurdle model: first, the model predicts whether a disease will be present (binary), and if present, it predicts the case count (integer). Here we compare the results of a boosted tree model to our baseline model.

Disease Status

disease status confusion matrix
.metric desc model full_model
accuracy proportion of the data that are predicted correctly baseline 0.81
xgboost 0.89
kap similar measure to accuracy(), but is normalized by the accuracy that would be expected by chance alone and is very useful when one or more classes have large frequency distributions. baseline 0.25
xgboost 0.67
sens the proportion of disease absent predictions out of the number of events which were actually absent baseline 0.99
xgboost 0.98
spec the proportion of disease present predictions out of the number of events which were actually present baseline 0.19
xgboost 0.62
disease status confusion matrix by taxa
.metric model birds buffaloes camelidae cats cattle cervidae dogs equidae hares/rabbits sheep/goats swine
accuracy baseline 0.82 0.710 0.760 0.70 0.80 0.680 0.68 0.88 0.820 0.82 0.84
xgboost 0.88 0.870 0.870 0.88 0.88 0.860 0.87 0.92 0.880 0.90 0.90
kap baseline 0.23 0.089 0.083 0.19 0.34 0.041 0.26 0.26 0.083 0.26 0.25
xgboost 0.61 0.690 0.640 0.73 0.68 0.690 0.72 0.62 0.510 0.67 0.64
sens baseline 0.99 1.000 1.000 1.00 0.99 1.000 0.99 1.00 0.990 0.99 0.99
xgboost 0.97 0.970 0.980 0.96 0.97 0.940 0.91 0.99 0.990 0.98 0.98
spec baseline 0.17 0.067 0.059 0.15 0.27 0.031 0.24 0.17 0.060 0.19 0.18
xgboost 0.56 0.680 0.590 0.74 0.67 0.720 0.80 0.50 0.420 0.62 0.57
disease status confusion matrix by continent
.metric model Africa Americas Asia Europe Oceania
accuracy baseline 0.80 0.76 0.81 0.84 0.890
xgboost 0.88 0.87 0.90 0.90 0.940
kap baseline 0.29 0.21 0.26 0.27 0.034
xgboost 0.64 0.69 0.71 0.62 0.570
sens baseline 0.99 1.00 0.99 0.99 1.000
xgboost 0.97 0.97 0.97 0.98 1.000
spec baseline 0.23 0.16 0.19 0.20 0.019
xgboost 0.60 0.67 0.68 0.56 0.450
disease status direction change confusion matrix
.metric desc model full_model
accuracy proportion of the data that are predicted correctly baseline 0.81
xgboost 0.89
kap similar measure to accuracy(), but is normalized by the accuracy that would be expected by chance alone and is very useful when one or more classes have large frequency distributions. baseline -0.01
xgboost 0.45

Note there are some baseline “outbreak ends” predictions. This occurs in cases where the lag1 disease status is 1, but the lag1 cases are 0 or NA. The predict() function predicts lag1 cases only when the lag1 disease status is 1.

disease status direction change confusion matrix by taxa
.metric model birds buffaloes camelidae cats cattle cervidae dogs equidae hares/rabbits sheep/goats swine
accuracy baseline 0.820 0.710 0.760 0.700 0.800 0.680 0.680 0.880 0.8200 0.8200 0.8400
xgboost 0.880 0.870 0.870 0.880 0.880 0.860 0.870 0.920 0.8800 0.9000 0.9000
kap baseline -0.016 -0.044 -0.013 -0.056 -0.015 -0.017 -0.027 0.011 0.0029 -0.0084 0.0023
xgboost 0.300 0.550 0.470 0.610 0.450 0.590 0.630 0.350 0.2400 0.4600 0.4100
disease status direction change confusion matrix by continent
.metric model Africa Americas Asia Europe Oceania
accuracy baseline 0.8000 0.760 0.810 0.840 0.890
xgboost 0.8800 0.870 0.900 0.900 0.940
kap baseline 0.0011 -0.028 -0.023 0.011 -0.017
xgboost 0.3900 0.500 0.510 0.380 0.270
disease status variable importance and partial dependency (xgboost only)
disease status partial dependency of disease_status_lag1 by select disease (xgboost only)
disease status partial dependency of ever_in_country_any_taxa by select disease (xgboost only)
disease status partial dependency of ever_in_country_given_taxa by select disease (xgboost only)
disease status partial dependency of cases_lag1_missing by select disease (xgboost only)
disease status partial dependency of log_human_population by select disease (xgboost only)
disease status partial dependency of log_taxa_population by select disease (xgboost only)
disease status partial dependency of disease_rabies by select disease (xgboost only)
disease status partial dependency of log_gdp_per_capita by select disease (xgboost only)
disease status partial dependency of cases_lag1 by select disease (xgboost only)
disease status partial dependency of cases_lag_sum_border_countries by select disease (xgboost only)
disease status partial dependency of disease_leptospirosis by select disease (xgboost only)
disease status partial dependency of first_reporting_semester by select disease (xgboost only)
disease status partial dependency of disease_status_lag1 by select direction change (xgboost only)
disease status partial dependency of ever_in_country_any_taxa by select direction change (xgboost only)
disease status partial dependency of ever_in_country_given_taxa by select direction change (xgboost only)
disease status partial dependency of cases_lag1_missing by select direction change (xgboost only)
disease status partial dependency of log_human_population by select direction change (xgboost only)
disease status partial dependency of log_taxa_population by select direction change (xgboost only)
disease status partial dependency of disease_rabies by select direction change (xgboost only)
disease status partial dependency of log_gdp_per_capita by select direction change (xgboost only)
disease status partial dependency of cases_lag1 by select direction change (xgboost only)
disease status partial dependency of cases_lag_sum_border_countries by select direction change (xgboost only)
disease status partial dependency of disease_leptospirosis by select direction change (xgboost only)
disease status partial dependency of first_reporting_semester by select direction change (xgboost only)

Cases

Here we evaluate the subset of the training data with positive case counts

cases model stats

## # A tibble: 6 x 4
##   model    .metric .estimator   .estimate
##   <chr>    <chr>   <chr>            <dbl>
## 1 baseline rmse    standard   485129.    
## 2 xgboost  rmse    standard   393058.    
## 3 baseline rsq     standard        0.0179
## 4 xgboost  rsq     standard        0.122 
## 5 baseline mae     standard     7545.    
## 6 xgboost  mae     standard     5853.
cases residuals
cases residuals by taxa
cases residuals by continent
cases variable importance and partial dependency (xgboost only)
cases partial dependency of cases_lag1 by select disease (xgboost only)
cases partial dependency of log_taxa_population by select disease (xgboost only)
cases partial dependency of log_veterinarians_per_taxa by select disease (xgboost only)
cases partial dependency of log_human_population by select disease (xgboost only)
cases partial dependency of log_gdp_per_capita by select disease (xgboost only)
cases partial dependency of cases_lag_sum_border_countries by select disease (xgboost only)
cases partial dependency of disease_rabies by select disease (xgboost only)
cases partial dependency of taxa_birds by select disease (xgboost only)
cases partial dependency of report_semester_1 by select disease (xgboost only)
cases partial dependency of continent_Asia by select disease (xgboost only)
cases partial dependency of disease_avian_infectious_bronchitis by select disease (xgboost only)
cases partial dependency of disease_infectious_bursal_disease by select disease (xgboost only)